Goto

Collaborating Authors

 analyzing twitter data


Getting Started Analyzing Twitter Data in Apache Kafka through KSQL

@machinelearnbot

KSQL is the open source streaming SQL engine for Apache Kafka. It lets you do sophisticated stream processing on Kafka topics, easily, using a simple and interactive SQL interface. In this short article we'll see how easy it is to get up and running with a sandbox for exploring it, using everyone's favourite demo streaming data source: Twitter. We'll go from ingesting the raw stream of tweets, through to filtering it with predicates in KSQL, to building aggregates such as counting the number of tweets per user per hour.


Getting Started Analyzing Twitter Data in Apache Kafka through KSQL

@machinelearnbot

You'll probably get a screenful of results; this is because KSQL is actually emitting the aggregation values for the given hourly window each time it updates. Since we've set KSQL to read all messages on the topic (SET'auto.offset.reset' 'earliest';) it's reading all of these messages at once and calculating the aggregation updates as it goes. Our inbound stream of tweets is just that--a stream. But now that we are creating aggregates, we have actually created a table. A table is a snapshot of a given key's values at a given point in time.